AITopics

2607.0251

Country: Asia (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Balasubramanian, Krishnakumar, Banerjee, Sayan, Korba, Anna

Uniform-in-time Propagation-of-Chaos for Stein Variational Gradient Descent

We study uniform-in-time propagation-of-chaos for continuous-time Stein Variational Gradient Descent (SVGD). Classical finite-time propagation-of-chaos estimates for mean-field systems typically deteriorate rapidly with time and therefore do not directly explain the long-time relation between the finite-particle system and its mean-field limit. We obtain two complementary classes of uniform-in-time propagation-of-chaos results. For broad distributional metrics, we introduce a cutoff strategy which combines finite-time propagation-of-chaos estimates up to an $N$-dependent horizon with independent quantitative long-time convergence estimates for the finite-particle and mean-field SVGD flows. This yields uniform-in-averaging-time propagation-of-chaos bounds in Langevin kernel Stein discrepancy, Wasserstein-1 distance, and Wasserstein-2 distance, with logarithmic or iterated-logarithmic rates depending on the metric, target and kernel class. We also develop a finite-dimensional theory for matrix-valued finite-rank kernels. For Gaussian targets with bilinear kernels, the SVGD dynamics close exactly on first and second moments, yielding genuine uniform-in-physical-time parametric propagation-of-chaos rates in finite-dimensional Stein-feature metrics. We then prove a conjugacy principle showing that these feature-level estimates transfer to conjugate target-kernel pairs under orientation-preserving diffeomorphisms, thereby extending the theory to broad classes of nonlinear, including multimodal, targets. Together, these results highlight the contrast between generic distributional metrics, for which our general approach yields logarithmic rates, and closed finite-dimensional Stein observables, for which parametric $N^{-1/2}$ propagation-of-chaos rates persist uniformly in time.

artificial intelligence, kernel, machine learning, (16 more...)

2607.00149

Country: North America > United States (0.67)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Bay, Yong Yi, Yearick, Kathleen A.

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.

grpo, machine learning, natural language, (17 more...)

2607.00152

Country: North America > United States > Illinois (0.40)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Huthasana, Krishna Harsha Kovelakuntla, Olama, Alireza, Lundell, Andreas

Entropy-Regularized Probabilistic Gates for Sparse Model Discovery in Scarce-Data Federated Learning

Federated Learning (FL) is a distributed machine learning (ML) paradigm with collaboration among multiple clients without sharing data. FL is challenging under data heterogeneity and partial client participation. Learning sparse models is useful for communication and computational efficiency in FL, but it is especially difficult in the small-sample high-dimensional regime (d >> N) where optimization can yield parameter configurations that fail to generalize to unseen test data. While magnitude-based pruning doesn't account for uncertainty exploration in the parameter space, a formulation with probabilistic gates and an L0 constraint allows sampling from competing sparse configurations during training. In this work, we study entropy regularization of gate distributions as a mechanism to maintain uncertainty in sparse federated optimization by preventing early commitment to sparse support. We examine its impact under data heterogeneity, client participation heterogeneity, and sparsity. Experiments on synthetic and real-world benchmarks show consistent improvements over federated iterative hard thresholding (Fed-IHT) and pruning after dense federated averaging (FedAvg) training, both in statistical performance on test data and in sparsity recovery accuracy.

artificial intelligence, machine learning, sparsity, (14 more...)

2607.00275

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.47)
Health & Medicine > Therapeutic Area > Hematology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Raeth, Kornelius, Ludwig, Nicole

Decision-Aware Training for Sample-Based Generative Models

Kornelius Raeth 1 Nicole Ludwig 1 2 Abstractscoring rules distribute the training gradient in proportion to Sample-based generative models are increasingly data density, with no awareness of the decision maker's cost structure. The model's limited capacity is allocated globused for probabilistic forecasting in high-stakes ally, leaving decision-critical regions of the output space decision settings, yet their training objectives are potentially underserved. These models are commonly trained with strictly proper Given a forecast, a decision maker with cost function c(a,y), scoring rules, such as the energy score, which al-of action aand outcome y, selects the action that minimises locate their training signal in proportion to dataexpected cost under the forecast distribution; a point forecast density, with no awareness of where forecast eris insufficient to evaluate this expectation. A good forecast rors are most costly for downstream decisions. Crucially, the energy score objective with a differentiable deci-observed cost of the optimal action is itself a proper scoring sion loss that directly penalises the cost incurredrule (Hartline et al., 2025; Kleinberg et al., 2023), placing by acting on the model's forecast. This combinedit in the same family as the energy score which licenses loss is theoretically grounded, as the decision losstheir combination as a theoretically well-founded training is itself a proper scoring rule. Introduction score acts as that anchor, preventing the model from collapsing outside cost-sensitive regions. Our method is theo-tion based on a temperature forecast, balancing asset loss against the cost of intervention. In the weather domain, retically grounded and leads to better downstream decisions state-of-the-art forecasting systems (Lang et al., 2024; Pricewhile retaining full probabilistic forecasts, as validated on et al., 2023) are trained with strictly proper scoring rulessynthetic and real-world forecasting tasks. A gradient analysis showing which regions benefitscore reduces to the continuous ranked probability score from the decision loss and why, based on the cost (CRPS), widely used in meteorological forecast verificafunction structure. Both model classes introduced above are commonly trained by minimising strictly proper sion calibration.

artificial intelligence, machine learning, natural language, (18 more...)

2607.01171

Country: Europe > Germany (0.28)

Genre: Research Report (0.81)

Industry: Energy > Renewable > Wind (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.64)

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

Liu, Xuefeng, Cao, Mingxuan, Huang, Qinan, Brettin, Thomas, Stevens, Rick, Cong, Le

Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization (RePO) mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy's own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative--rather than restrictive--throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SR Sim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.

active-grpo, large language model, machine learning, (18 more...)

2607.00531

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Au, Kwok Chun, Block, Adam

Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models

arXiv.org Machine LearningJul-1-2026

Many modern Language Model (LM) pipelines return an averaged model, such as an exponential moving average of the training iterates, rather than the final iterate itself. This raises a fundamental question: given that we will return an iterate average, how should we change training to improve the performance of this average? We study this question by formulating optimizer design for the iterate-average estimator as an optimal-control problem. In a continuous-time stochastic quadratic model, we solve for the control strategy that minimizes the error of the returned average subject to a penalty on the size of the intervention. A practical approximation to this controller yields PACE, a lightweight wrapper around AdamW that pulls the live weights toward their exponential moving average with a clipped, per-coordinate control strength. We prove that a stylized version of PACE converges at the standard stochastic convex optimization rate, up to a factor depending on the averaging rule, while in the quadratic setting it can strictly improve the limiting squared error of the iterate-average estimator and can do so by an arbitrarily large factor on some instances. Empirically, our results suggest that PACE improves over AdamW and EMA-evaluated AdamW in supervised fine-tuning of 1-2B parameter LMs and in GPT-2 pretraining on FineWeb for a wide range of learning rates, decay schedules, and other hyperparameters.

artificial intelligence, machine learning, natural language, (17 more...)

2606.25086

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Machine LearningJul-1-2026

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

Chen, Siyu, Lu, Miao, Wu, Beining, Sheen, Heejune, Zhang, Fengzhuo, Li, Shuangning, Li, Zhiyuan, Blanchet, Jose, Wang, Tianhao, Yang, Zhuoran

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

generator, large language model, machine learning, (19 more...)

2606.09052

Country: North America > United States > California (0.27)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
Information Technology > Game Theory (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Dereziński, Michał, Dong, Xiaoyu

How AI settled the complexity of the oldest SGD algorithm

arXiv.org Machine LearningJun-30-2026

An essential catalyst for the remarkable breakthroughs in AI that led to the modern large language models (LLMs) such as ChatGPT and Gemini has been the algorithms used to train these models on massive datasets. While the LLM architectures have gotten progressively more complex, the training algorithms have stayed relatively simple, and in fact, they have all been based on the decades-old paradigm of stochastic gradient descent (SGD). The key idea behind SGD is that in order to minimize a certain objective function (such as an LLM's error on the training data), it suffices to access only a noisy estimate of that objective at any given time (e.g., based on a small sample of the data) while making incremental progress towards the solution. This is essential for LLM training, as the datasets have become so massive one could not hope to perform computations on everything all at once. Commonly attributed to a 1951 paper by Robbins and Monro [34], SGD has seen a resurgence of interest over the last 20 years by AI researchers and computer scientists striving to understand its effectiveness, leading to numerous variants and extensions used in modern LLMs [12, 9], most notably the Adam algorithm [25]. As a result, we have gained a robust mathematical understanding of the computational complexity of SGD algorithms in a wide range of settings (e.g., see [11, 15, 5, 17]). Yet, despite this progress there is a surprising gap in the understanding of SGD: The complexity of an algorithm proposed by Stefan Kaczmarz in 1937 [24] for solving a system of linear equations - the oldest published example of an SGD algorithm, which predates Robbins and Monro's paper by over a decade - has not been settled.

large language model, machine learning, natural language, (22 more...)

2606.29593

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

arXiv.org Machine LearningJun-30-2026

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

Sweeney, John

Shuffle order can be a larger source of fine-tuning noise than a memoryless analysis predicts: fixed-clock optimizer memory makes local equal-multiset contrasts first order in the learning rate rather than second order, and the resulting order channel can be large enough for a single seed to flip a close A/B comparison. We isolate this mechanism and derive a fit-free way to size the noise it produces. For a memoryless optimizer, reordering an equal multiset has no first-order endpoint term; the leading local contrast is the $O(η^2)$ gradient bracket. Fixed-clock optimizers such as AdamW are different. Their moment buffers, preconditioner state, and de-biasing counters advance with the step index rather than with the learning-rate-scaled time $τ=ηk$, so the same gradient can receive a position-dependent endpoint weight. For any fixed finite measurement window, a lifted-state expansion gives an $O(η)$ equal-multiset contrast whenever the first-order replay coefficient is nonzero, while regular and clock-matched controls remain $O(η^2)$; a bare fixed-$β$ momentum buffer is already enough. A bitwise-deterministic replay from one warmed optimizer state isolates the mechanism, giving order-variance slopes 1.83 for AdamW, 2.00 for fixed-$β$ momentum, and 4.00 for SGD; matching the memory clock to $τ$ restores the regular exponent. For AdamW with a frozen preconditioner, the same impulse-weight kernel gives a closed-form asymptotic order-variance floor after the local potentials are measured, with no fitted coefficients. The result is local to the measurement window (independent reshuffling can average the channel across windows), but it yields order-noise error bars, positional attribution weights, and a seed-budget criterion for fine-tuning comparisons.

artificial intelligence, coefficient, machine learning, (13 more...)

2606.29554

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)